data category
Understanding the Influence of Synthetic Data for Text Embedders
Springer, Jacob Mitchell, Adlakha, Vaibhav, Reddy, Siva, Raghunathan, Aditi, Mosbach, Marius
Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Dominican Republic (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.47)
From On-chain to Macro: Assessing the Importance of Data Source Diversity in Cryptocurrency Market Forecasting
Demosthenous, Giorgos, Georgiou, Chryssis, Polydorou, Eliada
This study investigates the impact of data source diversity on the performance of cryptocurrency forecasting models by integrating various data categories, including technical indicators, on-chain metrics, sentiment and interest metrics, traditional market indices, and macroeconomic indicators. We introduce the Crypto100 index, representing the top 100 cryptocurrencies by market capitalization, and propose a novel feature reduction algorithm to identify the most impactful and resilient features from diverse data sources. Our comprehensive experiments demonstrate that data source diversity significantly enhances the predictive performance of forecasting models across different time horizons. Key findings include the paramount importance of on-chain metrics for both short-term and long-term predictions, the growing relevance of traditional market indices and macroeconomic indicators for longer-term forecasts, and substantial improvements in model accuracy when diverse data sources are utilized. These insights help demystify the short-term and long-term driving factors of the cryptocurrency market and lay the groundwork for developing more accurate and resilient forecasting models.
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
Eshuijs, Leon, Wang, Shihan, Fokkens, Antske
Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- (3 more...)
- Media > Film (0.49)
- Leisure & Entertainment (0.49)
Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning
Qiu, Wenjun, Lie, David, Austin, Lisa
A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost. Automated text selection and segmentation simplifies the labeling task, enabling untrained annotators from crowdsourcing platforms, like Amazon's Mechanical Turk, to be competitive with trained annotators, such as law students, and also reduces inter-annotator agreement, which decreases labeling cost. Having reliable labels for training enables the use of active learning, which uses fewer training samples to efficiently cover the input space, further reducing cost and improving class and data category balance in the data set. The combination of these techniques allows Calpric to produce models that are accurate over a wider range of data categories, and provide more detailed, fine-grain labels than previous work. Our crowdsourcing process enables Calpric to attain reliable labeled data at a cost of roughly $0.92-$1.71 per labeled text segment. Calpric 's training process also generates a labeled data set of 16K privacy policy text segments across 9 Data categories with balanced positive and negative samples.
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- (11 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.93)
Conversational Financial Information Retrieval Model (ConFIRM)
Choi, Stephen, Gazeley, William, Wong, Siu Ho, Li, Tingting
With the exponential growth in large language models (LLMs), leveraging their emergent properties for specialized domains like finance merits exploration. However, regulated fields such as finance pose unique constraints, requiring domain-optimized frameworks. We present ConFIRM, an LLM-based conversational financial information retrieval model tailored for query intent classification and knowledge base labeling. ConFIRM comprises two modules: 1) a method to synthesize finance domain-specific question-answer pairs, and 2) evaluation of parameter efficient fine-tuning approaches for the query classification task. We generate a dataset of over 4000 samples, assessing accuracy on a separate test set. ConFIRM achieved over 90% accuracy, essential for regulatory compliance. ConFIRM provides a data-efficient solution to extract precise query intent for financial dialog systems.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Spain (0.04)
- Asia > China > Hong Kong (0.04)
- Law (1.00)
- Government (1.00)
- Banking & Finance > Economy (0.94)
- Banking & Finance > Trading (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Data-centric Operational Design Domain Characterization for Machine Learning-based Aeronautical Products
Kaakai, Fateh, Adibhatla, Shridhar "Shreeder", Pai, Ganesh, Escorihuela, Emmanuelle
We give a first rigorous characterization of Operational Design Domains (ODDs) for Machine Learning (ML)-based aeronautical products. Unlike in other application sectors (such as self-driving road vehicles) where ODD development is scenario-based, our approach is data-centric: we propose the dimensions along which the parameters that define an ODD can be explicitly captured, together with a categorization of the data that ML-based applications can encounter in operation, whilst identifying their system-level relevance and impact. Specifically, we discuss how those data categories are useful to determine: the requirements necessary to drive the design of ML Models (MLMs); the potential effects on MLMs and higher levels of the system hierarchy; the learning assurance processes that may be needed, and system architectural considerations. We illustrate the underlying concepts with an example of an aircraft flight envelope.
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
- Transportation > Air (1.00)
- Government > Regional Government > North America Government > United States Government (0.93)
- Aerospace & Defense > Aircraft (0.69)
- Automobiles & Trucks (0.67)
Firmware implementation of a recurrent neural network for the computation of the energy deposited in the liquid argon calorimeter of the ATLAS experiment
Aad, Georges, Calvet, Thomas, Chiedde, Nemer, Faure, Robert, Fortin, Etienne Marie, Laatu, Lauri, Monnier, Emmanuel, Sur, Nairit
The ATLAS experiment measures the properties of particles that are products of proton-proton collisions at the LHC. The ATLAS detector will undergo a major upgrade before the high luminosity phase of the LHC. The ATLAS liquid argon calorimeter measures the energy of particles interacting electromagnetically in the detector. The readout electronics of this calorimeter will be replaced during the aforementioned ATLAS upgrade. The new electronic boards will be based on state-of-the-art field-programmable gate arrays (FPGA) from Intel allowing the implementation of neural networks embedded in firmware. Neural networks have been shown to outperform the current optimal filtering algorithms used to compute the energy deposited in the calorimeter. This article presents the implementation of a recurrent neural network (RNN) allowing the reconstruction of the energy deposited in the calorimeter on Stratix 10 FPGAs. The implementation in high level synthesis (HLS) language allowed fast prototyping but fell short of meeting the stringent requirements in terms of resource usage and latency. Further optimisations in Very High-Speed Integrated Circuit Hardware Description Language (VHDL) allowed fulfilment of the requirements of processing 384 channels per FPGA with a latency smaller than 125 ns.
Phenotype Detection in Real World Data via Online MixEHR Algorithm
Xu, Ying, Gauriau, Romane, Decker, Anna, Oppenheim, Jacob
Understanding patterns of diagnoses, medications, procedures, and laboratory tests from electronic health records (EHRs) and health insurer claims is important for understanding disease risk and for efficient clinical development, which often require rules-based curation in collaboration with clinicians. We extended an unsupervised phenotyping algorithm, mixEHR, to an online version allowing us to use it on order of magnitude larger datasets including a large, US-based claims dataset and a rich regional EHR dataset. In addition to recapitulating previously observed disease groups, we discovered clinically meaningful disease subtypes and comorbidities. This work scaled up an effective unsupervised learning method, reinforced existing clinical knowledge, and is a promising approach for efficient collaboration with clinicians.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
- Asia > Middle East > Jordan (0.04)
Towards Infield Navigation: leveraging simulated data for crop row detection
de Silva, Rajitha, Cielniak, Grzegorz, Gao, Junfeng
Agricultural datasets for crop row detection are often bound by their limited number of images. This restricts the researchers from developing deep learning based models for precision agricultural tasks involving crop row detection. We suggest the utilization of small real-world datasets along with additional data generated by simulations to yield similar crop row detection performance as that of a model trained with a large real world dataset. Our method could reach the performance of a deep learning based crop row detection model trained with real-world data by using 60% less labelled real-world data. Our model performed well against field variations such as shadows, sunlight and grow stages. We introduce an automated pipeline to generate labelled images for crop row detection in simulation domain. An extensive comparison is done to analyze the contribution of simulated data towards reaching robust crop row detection in various real-world field scenarios.
- Europe > United Kingdom > England > Lincolnshire > Lincoln (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
The 6-Ds of Creating AI-Enabled Systems
We are entering our tenth year of the current Artificial Intelligence (AI) spring, and, as with previous AI hype cycles, the threat of an AI winter looms. AI winters occurred because of ineffective approaches towards navigating the technology valley of death. The 6-D framework provides an end-to-end framework to successfully navigate this challenge. The 6-D framework starts with problem decomposition to identify potential AI solutions, and ends with considerations for deployment of AI-enabled systems. Each component of the 6-D framework and a precision medicine use case is described in this paper.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Maryland > Prince George's County > Laurel (0.04)
- North America > United States > Florida > Orange County > Orlando (0.04)